Add Isaac-0.2-2B-Preview VLM contrib model by jimburtoft · Pull Request #154 · aws-neuron/neuronx-distributed-inference

jimburtoft · 2026-05-01T01:34:44Z

Note: The below template includes items meant for model contributions only.

Description

Isaac-0.2-2B-Preview is a 2.57B vision-language model from PerceptronAI, combining a standard Qwen3 text backbone with a SigLIP2 vision encoder and 2-layer MLP projector with pixel shuffle. Onboarded to Neuron via NxDI's NeuronBaseForImageToText framework.

Validated on trn2.3xlarge (LNC=2, TP=1, BF16) with text-only cosine similarity 0.999978 vs CPU reference, 110.7 tok/s text-only and 108.7 tok/s image+text generation.

Model Information

Model Name: Isaac-0.2-2B-Preview

Model Architecture: Vision-language model (SigLIP2 encoder + pixel shuffle + 2-layer MLP projector + Qwen3 decoder)

HuggingFace: PerceptronAI/Isaac-0.2-2B-Preview

License: CC-BY-NC-4.0

Checklist

Required Components

Accuracy Test (test/integration/validate_text_logits.py)
- Compares Neuron BF16 first-token logits against CPU FP32 reference across 5 text prompts
- Average cosine similarity: 0.999978
- Top-1 match: 5/5, Top-5 overlap: 5.0/5, Top-10 overlap: 9.8/10
- Additional accuracy tests: validate_image_text.py (3 image+text E2E tests), validate_vision_encoder.py, validate_tkg.py
README.md with the following sections:
- Usage Example: Compile and run examples for text-only and image+text inference
- Compatibility Matrix: Validated on trn2.3xlarge (LNC=2, TP=1/2/4) with SDK 2.29
- Example Checkpoints: PerceptronAI/Isaac-0.2-2B-Preview
- Testing Instructions: How to run each test script
- Benchmark Results: Performance numbers for text-only and image+text
- Known Limitations: BS>1, NKI kernel constraints, vLLM image+text
Source Code (src/isaac_neuron/)
- modeling_isaac.py: Top-level VLM orchestrator (NeuronBaseForImageToText)
- modeling_isaac_text.py: Text backbone (NeuronBaseModel wrapping NxDI Qwen3 layers)
- modeling_isaac_vision.py: Vision encoder wrapper (SigLIP2 + pixel shuffle + MLP projector)
- siglip/modeling_siglip.py: SigLIP2 encoder (adapted from Gemma3-vision contrib)
- siglip/layers.py: Parallel Conv2d for vision patch embedding
- ndxi_patch.py: SDK 2.29 compatibility patches
- utils.py: Shared utilities

Optional Components

Integration Tests (test/integration/)
- validate_text_logits.py: First-token logit accuracy (CPU vs Neuron)
- validate_tkg.py: Token generation quality and throughput
- validate_image_text.py: End-to-end multimodal generation
- validate_vision_encoder.py: Vision encoder output validation
- test_tp.py: Tensor parallelism at TP=1, 2, 4
- test_kernels.py: NKI kernel compatibility sweep
- test_scaling.py: Sequence length scaling (1024-8192)
- test_weight_loading.py: State dict key mapping validation
- benchmark.py: Formal benchmark harness (10 iterations, 3 warmup)
- run_isaac.py: Quick compile + run utility
vLLM Integration (vllm/)
- patch_vllm_isaac.py: Automated 3-file vllm-neuron patch script
- run_offline_inference.py: Offline inference example
- run_online_inference.py: OpenAI-compatible API client
- start-vllm-server.sh: Server launch script
- README.md: Setup and usage documentation
- Status: Text-only serving works (~78 tok/s). Image+text has a known pixel_values format mismatch.
GPU Benchmark (gpu_benchmark/)
- benchmark_gpu.py: L40S benchmark script (vLLM 0.20.0, CUDA graphs enabled)
- gpu_benchmark_results.json: Full results (4 workloads)

Folder Structure

/contrib/models/Isaac-0.2-2B/
  README.md
  /src/isaac_neuron/
    __init__.py
    modeling_isaac.py
    modeling_isaac_text.py
    modeling_isaac_vision.py
    ndxi_patch.py
    utils.py
    /siglip/
      __init__.py
      modeling_siglip.py
      layers.py
  /test/
    __init__.py
    /integration/
      __init__.py
      benchmark.py
      run_isaac.py
      test_kernels.py
      test_scaling.py
      test_tp.py
      test_weight_loading.py
      validate_image_text.py
      validate_text_logits.py
      validate_tkg.py
      validate_vision_encoder.py
  /vllm/
    README.md
    add_execute_model.py
    patch_vllm_isaac.py
    run_offline_inference.py
    run_online_inference.py
    start-vllm-server.sh
  /gpu_benchmark/
    benchmark_gpu.py
    gpu_benchmark_results.json
    nuke_perceptron_import.py
    patch_gpu_modular.py
    setup_gpu.sh
    fix_indent.py

Testing

How did you test this change?

All tests run on trn2.3xlarge (LNC=2, TP=1) with Neuron SDK 2.29 (DLAMI 20260410, NxDI 0.9.17334).

Accuracy: First-token logits compared against CPU FP32 reference — avg cosine 0.999978 across 5 prompts
Text generation: 5 text-only prompts generate coherent output at 94-111 tok/s
Image+text: 3 multimodal prompts generate correct image descriptions at 104-108 tok/s
Tensor parallelism: TP=1, 2, 4 all compile and pass accuracy gates (cosine 0.9999+)
Sequence scaling: seq_len 1024-8192 all compile and run correctly
NKI kernels: CTE flash attention works; MLP/QKV kernels documented as incompatible at TP=1
vLLM: Text-only serving verified via offline inference (~78 tok/s)
GPU comparison: L40S benchmark via vLLM 0.20.0 with CUDA graphs

Benchmark Results (trn2.3xlarge, TP=1, BF16, seq_len=1024, 10 iterations):

Mode	Throughput	TPOT
Text-only	110.7 tok/s	9.0ms
Image+text	108.7 tok/s	9.2ms
Projected DP=4	~443 tok/s	—

GPU Comparison (L40S, BF16, vLLM 0.20.0, CUDA graphs enabled):

Metric	L40S GPU	trn2 Neuron (TP=1)	trn2 Neuron (DP=4)
TPOT (short input)	5.75ms	9.0ms	—
Throughput (short)	174 tok/s	111 tok/s	~443 tok/s
TPOT (long input)	6.09ms	9.0ms	—
Throughput (long)	164 tok/s	111 tok/s	~443 tok/s

L40S GPU is 1.5x faster per-core than a single NeuronCore. At the device level (DP=4), trn2.3xlarge is 2.5x faster than L40S.

NxDI implementation of PerceptronAI/Isaac-0.2-2B-Preview VLM: - Qwen3 text backbone with SigLIP2 vision encoder - 2-layer MLP projector with pixel shuffle (64 vision tokens/image) - Supports TP=1/2/4, seq_len up to 8192 - 110.7 tok/s text-only, 108.7 tok/s image+text on trn2.3xlarge - 9.0ms TPOT at seq_len=1024 - BF16, CTE flash attention enabled - Validated: cosine 0.9999+ vs CPU reference across all configs

- vLLM-neuron integration with 3-file patch (text-only working, ~78 tok/s) - GPU comparative benchmark: L40S at 52 tok/s vs trn2 at 111 tok/s (2.13x speedup) - modular_isaac.py perceptron import fix (nuke_perceptron_import.py) - execute_model override for logits-to-token-ID conversion - Known limitation: image+text via vLLM not yet supported (pixel_values format mismatch)

Previous benchmark used enforce_eager=True which handicapped GPU to 52 tok/s. With CUDA graphs + torch.compile + FlashAttention v2, L40S achieves 174 tok/s. GPU is 1.5x faster per-core than single NeuronCore, but trn2 DP=4 is 2.5x faster at device level.

jimburtoft added 4 commits April 30, 2026 17:31

Add GPU comparison metrics to README and kernel sweep test

e443020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Isaac-0.2-2B-Preview VLM contrib model#154

Add Isaac-0.2-2B-Preview VLM contrib model#154
jimburtoft wants to merge 4 commits intoaws-neuron:mainfrom
jimburtoft:contrib/isaac-0.2-2b

jimburtoft commented May 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jimburtoft commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Model Information

Checklist

Required Components

Optional Components

Folder Structure

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jimburtoft commented May 1, 2026 •

edited

Loading